Error Backpropagation

Overview

Backpropagation is the key algorithm for training neural networks. It is a generalization of the chain rule for multivariate calculus to the case of multiple layers of neurons. It was invented in 1974 by Paul Werbos and later popularized by Geoffrey Hinton, Yann LeCun, and Yoshua Bengio in the 1980s.

Mathematical Formulation

flowchart LR
    subgraph L1["$$\text{Layer } L-1$$"]
        i(("$$i$$"))
    end
    subgraph L2["$$\text{Layer } L$$"]
        j(("$$j$$"))
    end
    subgraph L3["$$\text{Layer } L+1$$"]
        k(("$$k$$"))
    end
    i --> wji["$$w_{ji}^{(L)}$$"] --> j
    j --> wkj["$$w_{kj}^{(L+1)}$$"] --> k

Here’s how the math works conceptually: in the forward pass, each layer computes a weighted sum then applies an activation function. In backpropagation, we work backwards using the chain rule — computing an “error signal” (delta, \(\delta\)) at each neuron, then using those deltas to get the gradient for every weight.

Here’s a summary of the four phases shown in the interactive demo:

Network setup used: input \(x = 0.5\), target \(y = 1.0\), sigmoid activations on hidden layers, linear output for regression, MSE loss.

① Forward pass — compute left to right. Each neuron computes \(z = W \cdot a + b\) (weighted sum + bias), then \(a = \sigma(z)\) for hidden units, or \(\hat{y} = z\) for the output. The network predicts \(\hat{y} \approx 0.3890\) against a target of 1.0.

② Loss — measure the error. \(L = (\hat{y} - y)^2 = 0.3733\). The derivative \(\frac{\partial L}{\partial z}\) at the output \(= 2(\hat{y} - y) = -1.2220\), which seeds the backward pass.

③ Deltas (\(\delta\)) — propagate error backward. The key formula is:

\[ \delta^{(l)}_j = \left[\sum_k W^{(l+1)}_{kj} \cdot \delta^{(l+1)}_k\right] \cdot \sigma'(z_j) \]

Each neuron’s delta = “how much error flows back through the weights connecting it forward” × “its local slope \(\sigma'(z)\)”. The \(W^T \cdot \delta\) part is what makes it efficient — you reuse the same weights you used going forward, just transposed.

④ Gradients (\(\frac{\partial L}{\partial W}\)) — compute weight updates. Once you have the deltas:

\[ \frac{\partial L}{\partial W^{(l)}_{ji}} = \delta^{(l)}_j \cdot a^{(l-1)}_i \]

This is an outer product: each weight’s gradient equals its downstream error signal (\(\delta\)) times the upstream activation (\(a\)) that flowed through it. Then update: \(W \leftarrow W - \eta \cdot \frac{\partial L}{\partial W}\).

Derivation

flowchart TD
    subgraph delta["$$\text{Layer } L$$"]
        D["$$\Delta z_j$$"]
    end
    D --> D1["$$\Delta z_1$$"]
    D --> D2["$$\Delta z_2$$"]
    D --> D3["$$\cdots$$"]
    D --> DK["$$\Delta z_K$$"]

Dependency flow: A change \(\Delta z_j^{(L)}\) in the weighted input to neuron \(j\) in layer \(L\) propagates forward to affect all neurons \(k\) in layer \(L+1\):

Weighted sum for neuron \(k\) in layer \(L+1\): \[ z_k^{(L+1)} = \sum_{j=1}^{J} w_{kj}^{(L+1)} \sigma(z_j^{(L)}) = \sum_{j=1}^{J} w_{kj}^{(L+1)} a_j^{(L)} \] where \(a_j^{(L)} = \sigma(z_j^{(L)})\).

Error signal propagation: Using \(\frac{\partial L}{\partial z_j^{(L)}} = \delta_j^{(L)}\) and the chain rule: \[ \delta_j^{(L)} = \sum_{k=1}^{K} \frac{\partial L}{\partial z_k^{(L+1)}} \frac{\partial z_k^{(L+1)}}{\partial z_j^{(L)}} = \sum_{k=1}^{K} \delta_k^{(L+1)} \frac{\partial z_k^{(L+1)}}{\partial z_j^{(L)}} \]

Partial derivative (how \(z_k^{(L+1)}\) changes w.r.t. \(z_j^{(L)}\)): \[ \frac{\partial z_k^{(L+1)}}{\partial z_j^{(L)}} = w_{kj}^{(L+1)} \sigma'(z_j^{(L)}) \]

Substituting: \[ \delta_j^{(L)} = \sum_{k=1}^{K} \delta_k^{(L+1)} w_{kj}^{(L+1)} \sigma'(z_j^{(L)}) \]

Weighted sum for neuron \(j\) in layer \(L\): \[ z_j^{(L)} = \sum_{i=1}^{I} w_{ji}^{(L)} a_i^{(L-1)} \] where \(a_i^{(L-1)} = \sigma(z_i^{(L-1)})\) is the activation from layer \(L-1\).

Gradient for weight \(w_{ji}^{(L)}\): \[ \frac{\partial z_j^{(L)}}{\partial w_{ji}^{(L)}} = a_i^{(L-1)} \] \[ \frac{\partial L}{\partial w_{ji}^{(L)}} = \delta_j^{(L)} \cdot a_i^{(L-1)} = \left( \sum_{k} \delta_k^{(L+1)} w_{kj}^{(L+1)} \right) \sigma'(z_j^{(L)}) \, a_i^{(L-1)} \]

Layer connectivity (indices used in backpropagation):

Neuron \(j\) in layer \(L\) has activation \(a_j = \sigma(z_j)\). Weight \(w_{ji}\) connects neuron \(i\) (layer \(L-1\)) to neuron \(j\); weight \(w_{kj}\) connects neuron \(j\) to neuron \(k\) (layer \(L+1\)).